Dataset

The data set contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3)
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm^3)
  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

Univariate Plots Section

We start by looking at histograms for every variable. Histograms are a simple way to provide an overview of the individual variable distributions. (Quality will be investigated separately.)

Most variables are symmetrically distributed with high peaks. Residual sugar and alcohol are more right skewed. Most variables seem to have outliers on the upper scale (vol.acid, citric, sugar, chlorides, freeSO2, density). We will be mostly interested in examining relationships between quality and other variables.

We see a few distributions with long tails or outliers. To get a more detailed view on the distributions, we will look at the individual histograms. Before we do so, we take a look at the distribution of the wine quality:

We notice that the vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

A first look at the distribution revealed several variables with potential outliers. Therefore, we will replot individual histograms of variables that showed long tail distributions. Red lines will enclose the center 50% of the data. In the following we will further investigate the distributions of citric acids, residual sugar, chlorides, free sulfur dioxide and density because the overview above doesn’t clearly reveal the behavior at the upper end of the tails.

Removing the upper 1% of the data results in a very symmetric, normal distribution. Tails are quite long on both sides. There is a smaller peak on the right at about 0.75 (in fact 41 wines show a value of 0.74 for citric acids).

As sugar showed a very skewed distribution with a long tail it is hard to see the actual shape of the distribution. We get a better grasp of the data when plotting it on a logarithmic scale:

Almost 25% of the data falls below a value of 2 grams per cm^3. The remaining data goes up to about 30. There is one apparent outlier above 50.

Chlorides show a very long upper tail. Again, a logarithmic scale…

…can reveal further insight. We can see how the top 25% stretch out all the way to values of 0.3 and higher, while the remaining 75% remain below 0.05.

For free sulfur dioxide the histogram of the full data was driven by one outlier. Removing it gives the following histogram:

More precisely, we removed the top 0.1% and included a blue line that shows the 99% quantile. We can easily see that the top 1% of the free sulfur dioxide values produce a long tail of an otherwise symmetric distribution. The 99% quantile has a value of 81. The maximum values is located at 289 and therefore an extreme outlier.

Lastly, we have a closer look at the distribution of the density variable. Here, it is enough to remove the two highest values (1.0103 and 1.03898) to allow a better view of the distribution:

We see that density is not that peaked as it appeared at first sight.

In order to get more information about potential outliers we compute means and quartiles.

##     fix.acid         vol.acid          citric           sugar       
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides          freeSO2          totalSO2        density      
##  Min.   :0.00900   Min.   :  2.00   Min.   :  9.0   Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00   1st Qu.:108.0   1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00   Median :134.0   Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31   Mean   :138.4   Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00   3rd Qu.:167.0   3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00   Max.   :440.0   Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

Means and medias are usually very close supporting our observation of mostly symmetric and peaked distributions. All variables show a narrow interquartile range (IQR). Their maximum values on the other hand are quite extreme. We will continue our analysis of some of these extreme values in the next section.

Summary Univariate Analysis

What is the structure of the dataset? Did you create any new variables from existing variables in the dataset? Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data?

The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

We notice that vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10.

The chemical attributes show mostly symmetric and peaked distributions. Exceptions include the variables for residual sugar content and alcohol. Chemical attributes are naturally bounded below (usually by 0) whereas there can be more variability on the upper limit, which leads to longer upper tails (e.g. free sulfur dioxide or chlorides). Except for alcohol, all variables contain quite a few extremely high values.

What is/are the main feature(s) of interest in your dataset?

Naturally, we are interested in identifying the chemical properties of the white wines that could have influenced the quality rating. We will try to detect relationships between the rating (variable “quality”) and the variables describing the chemical properties.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

(High) quality is probably not influenced by a single variable but rather a (optimal?) combination of chemical properties. Thus, it might be interesting to investigate not only bivariate but multivariate relationships.

Bivariate Plots Section

We continue our analysis regarding outliers.

To measure how far the maximum values deviate from the majority of values we compute how many multiples of the IQR the maximum is away from Q3:

##  fix.acid  vol.acid    citric     sugar chlorides   freeSO2  totalSO2 
##  6.900000  7.090909 10.583333  6.817073 21.142857 10.565217  4.627119 
##   density        pH sulphates   alcohol   quality 
##  9.795545  2.842105  3.785714  1.473684  3.000000

The highest chlorides value is even more than 20 times the IQR away from the third quartile. Also, all other variables except for alcohol show outliers.

We start with the four variables with the most extreme outliers according to the measure above:

  1. chlorides,
  2. citric,
  3. freeSO2,
  4. density.

We produce boxplots with and without outliers across the different quality ratings. This comparison will help to determine if the outliers show the same relationship with quality as the less extreme values. If this would not be the case, we might want to remove the outliers to strengthen the relationship between a variable and the wine quality (which is our main area of interest).

For our purposes, an outlier is defined as a value falling outside the interval [Q1 - 2 x IQR , Q3 + 2 x IQR].

  1. Chlorides:

There are 172 data points with chloride values 2 times the interquartile range above the third quartile. Most of them with a quality of 5 and 6. Deleting the outliers would strengthen the correlation between quality and chlorides from -0.2099344 to -0.2767492, which would make it interesting to delete the outliers. Nevertheless, as we don’t have enough information about the data generation we abstain from deleting that many values.

  1. Citric acids:

There are 125 outliers on the upper scale. Deleting outliers would strengthen the correlation slightly, increasing it from -0.0092091 to 0.0184377. However, correlation between the two variables is rather weak and we decide to keep all values.

  1. Free sulfur dioxide:

There are 26 outliers on the upper scale. Deleting outliers would strengthen the correlation from 0.0081581 to 0.0308116. One outlier is particularly extreme. However, as it also results in an ‘extreme’ rating, it would probably make sense to keep it.

  1. Density:

There are 3 outliers on the upper scale. Deleting outliers would strengthen the correlation from -0.3071233 to -0.3172324. Similar to the variable free sulfur dioxide, there is one particularly extreme value. In contrast to the outlier before, the extreme value doesn’t result in an ‘extreme’ quality rating. Therefore, we decide to discard the three highest values as they contradict the otherwise quite strong (negative) correlation with quality.

After removing outliers we focus on correlations and dependencies among the variables.

In order to get an overview of all bivariate relationships, we use ggpairs on a subsample:

The correlation factors for the full dataset (with the variable quality) are as follows:

##  [1] "fix.acid"  "vol.acid"  "citric"    "sugar"     "chlorides"
##  [6] "freeSO2"   "totalSO2"  "density"   "pH"        "sulphates"
## [11] "alcohol"   "quality"
##  [1] -0.113815100 -0.195886513 -0.009250724 -0.100114240 -0.210031177
##  [6]  0.008206527 -0.174835130 -0.317232428  0.099423895  0.053710048
## [11]  0.435842693  1.000000000

We can observe the strongest correlation with quality for alcohol and density. The pairwise plots show that alcohol is strongly correlated with density (approx. -0.8) and also residual sugar (approx. -0.46). Volatile acidity and chlorides give correlation coefficients of about -0.2. Total sulfur dioxide gives a correlation factor of -0.175. As we saw earlier, the weakest correlation is found for citric acid (-0.099). Other correlation factors are approx. between +/-0.1.

In the next sections, we will proceed as follows: Before we examine relationships between wine quality and other variables (our main interest), we investigate relations among the chemical attributes. We will focus on alcohol, residual sugar and density because among these correlations are particularly strong and can be explained with human intuition.

Let’s have a closer look at these variables, where we can expect a clearly visible relation. As linear correlation factors are quite strong, we will produce scatter plots including a regression line to visualize the linear correlation:

## [1] "correlation coefficient"
## [1] -0.8041518

## [1] "correlation coefficient"
## [1] 0.8320888

As the density of alcohol is lower than the density of water, we can observe a very linear relationship between alcohol and density with a correlation coefficient of -0.804. On the other hand, sugar increases the density of water/wine, so that we see the same linear relationship with a clear upward trend (correlation coefficient of +0.832). Interestingly, it seems like this relationship doesn’t hold for low sugar contents. But we have to keep in mind that the influence of sugar content is higher with increasing values.

In the section on multivariate plots, we will demonstrate that on the same (low) sugar level, density varies with alcohol.

Another combination of variables that could be interesting is alcohol and sugar as both play a crucial role in the wine production. During the fermentation process sugar is transformed into alcohol, so high amounts of residual(!) sugar may indicate an early stop of the fermentation process which would lead to a lower alcohol content. The actual behavior is shown in the next scatterplot.

Although we obtain a correlation coefficient of -0.4591654 (supporting our assumption), we see a diffuse pattern. Especially for low amounts of residual sugar, there seems to be no or only little influence on the alcohol content. However, in the section on multivariate plots, we will be able to identify interactions between alcohol and sugar when it comes to the wine quality!

After looking at only chemical attributes, we now investigate the relationship between quality and other variables. We will focus on the four variables with the strongest correlation:

  1. Alcohol,
  2. chlorides,
  3. volatile acidity and
  4. total sulfur dioxide.

Boxplots for every quality level may reveal any possible patterns between a variable and the wine quality. (Since quality is a factor variable, boxplots are more appropriate than scatterplots as they do not require scaling.)

We start with alcohol because it showed the strongest linear correlation with quality.

  1. Alcohol:

## Source: local data frame [7 x 3]
## 
##   quality mean(alcohol) median(alcohol)
## 1       3      10.34500           10.45
## 2       4      10.15245           10.10
## 3       5       9.80884            9.50
## 4       6      10.57648           10.50
## 5       7      11.36794           11.40
## 6       8      11.63600           12.00
## 7       9      12.18000           12.50

Looking at the means and medians we can see a linear increase along with quality of 5 or higher. The range of alcohol for a given quality rating is quite big and overlaps with values for other quality ratings.

  1. Chlorides:

## Source: local data frame [7 x 3]
## 
##   quality mean(chlorides) median(chlorides)
## 1       3      0.05430000             0.041
## 2       4      0.05009816             0.046
## 3       5      0.05154633             0.047
## 4       6      0.04519727             0.043
## 5       7      0.03819091             0.037
## 6       8      0.03831429             0.036
## 7       9      0.02740000             0.031

Means for chlorides (boxplots on a log scale) are almost stricly linearly decreasing with increasing quality. Medians, again, show a linear relationship for quality of 5 and higher. The IQR of chlorides overlap for different quality ratings.

  1. Total sulfur dioxide:

## Source: local data frame [7 x 3]
## 
##   quality mean(totalSO2) median(totalSO2)
## 1       3       170.6000            159.5
## 2       4       125.2791            117.0
## 3       5       150.9046            151.0
## 4       6       137.0014            132.0
## 5       7       125.1148            122.0
## 6       8       126.1657            122.0
## 7       9       116.0000            119.0

The IQR overlap as for the previous plots. Once again, we can observe a break for the means of medians of total sulfur dioxide around wine quality between 4 and 5. Same holds for the following plot.

  1. Volatile acidity:

## Source: local data frame [7 x 3]
## 
##   quality mean(vol.acid) median(vol.acid)
## 1       3       0.333250             0.26
## 2       4       0.381227             0.32
## 3       5       0.302011             0.28
## 4       6       0.260180             0.25
## 5       7       0.262767             0.25
## 6       8       0.277400             0.26
## 7       9       0.298000             0.27

For the four variables with the strongest correlation with quality we could observe different behaviors for qualities above and below 5. That is why we group the quality ratings of 3-5 and assign it their median value 5 (mean: 4.8762195). Why this might be useful can be seen in the following boxplot (in comparison to 1. above).

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(alcohol) median(alcohol)
## 1               5       9.84953             9.6
## 2               6      10.57648            10.5
## 3               7      11.36794            11.4
## 4               8      11.63600            12.0
## 5               9      12.18000            12.5

We can see that grouping the lower quality ratings into a single rating has the nice effect that means and medians are now strictly monotonic. Overlap of the IQR lessens. Nevertheless, the monotony only works on average.

Another example is the following boxplot (cf. 2.):

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(chlorides) median(chlorides)
## 1               5      0.05143598             0.047
## 2               6      0.04519727             0.043
## 3               7      0.03819091             0.037
## 4               8      0.03831429             0.036
## 5               9      0.02740000             0.031

For chlorides the same effect can be produced at least for the median values. For (3.) total sulfur dioxide…

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(totalSO2) median(totalSO2)
## 1               5       148.5979              149
## 2               6       137.0014              132
## 3               7       125.1148              122
## 4               8       126.1657              122
## 5               9       116.0000              119

and (4.) volatile acidity…

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(vol.acid) median(vol.acid)
## 1               5      0.3102652             0.29
## 2               6      0.2601800             0.25
## 3               7      0.2627670             0.25
## 4               8      0.2774000             0.26
## 5               9      0.2980000             0.27

…it doesn’t work that well, but the trend becomes more visible. For total sulfur dioxide we could think about combining qualities of 8 and 9, but we don’t want to follow this path here.

Summary Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We can observe some linear relationships between quality and other variables, in particular alcohol, chlorides, total sulfur dioxide and volatile acidity. On average, we can produce strictly monotonic relationships for at least two variables (alcohol and chlorides). In all cases, quality doesn’t separate the levels of a chemical variable into distinct groups. Monotony can only be achieved on average.

We decided to delete three outliers as they showed a deviation from the observed relationship between quality and density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a clear (linear) relationship between density and alcohol and density and sugar. We observed that for low sugar content, the influence of alcohol on the density becomes stronger.

What was the strongest relationship you found?

The strongest (and also most obvious) relationship is the one between residual sugar and density. Also, alcohol and density are strongly correlated, even though residual sugar has the stronger influence (because “adding” alcohol cannot lower the density below the density of alcohol itself).

The strongest relationship between quality and another variable is found for alcohol.

We will look at relationships of more than two variables in the next section.

Multivariate Plots Section

Before we start examine the influence of other variables on the wine quality, we summarize the relationship between alcohol, sugar and density. In the bivariate plots section we have seen that alcohol and sugar show a strong linear relationship with density. Also, we have seen that the linear relationship is strongest for high sugar content. Here, we would like to visualize the relationship of all three variables:

From the color coding, we can easily infer that density is highest when the amount of residual sugar is high and alcohol is low (and vice versa). For a fixed sugar level, density varies with the alcohol content. For a given alcohol level, density increases with increasing sugar content. The linear correlation between alcohol and sugar is weaker than the one between alcohol / sugar and density because there are a lot of wines with low sugar content, in fact half of the wines have got a sugar content below 5.2 (black line, we stretched out the low sugar levels using a log scale). If the sugar content is close to zero, the influence of other variables (in particular alcohol) on density will be superior. We can visualize this effect by zooming in:

Let’s turn our attention back to the wine quality. In this section we are looking for interactions between the chemical attributes influencing the wine quality. So far, we found significant relationships for 1. alcohol, 2. chlorides, 3. total sulfur dioxide and 4. volatile acidity with the white wine quality. Now, we would like to investigate how other variables (possibly) influence these relationships.

Chlorides represent the amount of salt. We have seen that very high levels of chlorides tend to go hand in hand with lower quality. This might be offset by other variables. Two that come to mind are the amount of residual sugar (adding sweetness) and citric acid (which can add “freshness” and flavor to the wine). Therefore, we produce scatterplots of a) chlorides vs. sugar and b) chlorides vs. citric acid (medians in blue) for every quality level. Interactions would result in different (or “evolving”) patterns along with quality that are not solely driven by a vertical shift.

We use the grouped quality assignment.

1.a)

We cannot identify any clear interactions by adding sugar to our analysis.

1.b) Let us look at citric acid. We cut off the upper and lower 5% to allow for better visibility:

We cannot identify any additional interactions by including citric acid into our analysis.

Second, we want to investigate interactions between the alcohol and sugar regarding the wine quality. We expect to find interactions because both variables are part of the fermentation process (see bivariate plots section). Again, we use a logarithmic scale for sugar because the distribution is heavily right skewed (see first section):

We fitted second degree (natural cubic) B-splines to reveal trends more clearly. On average, quality increases with alcohol and decreases with residual sugar content (especially very high sugar content is more often found for wines with quality of 6 or less) - so far so good. However, for sugar content between 3 and 10 (estimate) alcohol increases stronger with quality as for lower sugar levels. For higher sugar contents, alcohol even seems to decrease on average.

To display the different behavior for high sugar levels, we “cut” the variable in to four groups (0,2], (2,6], (6,10] and (10,max] and plot the distribution of alcohol across the wine quality:

We see that for medium sugar levels between 2 and 10 alcohol levels increase a little more than for lower levels. For sugar levels between 2 and 6 median alcohol content is already higher than 12 for a quality of 7. Remember that, considering only alcohol and quality, such high median alcohol level wasn’t observed for qualities less than 8. Here, alcohol increases even further for qualities of 8 and 9 in that range of sugar values. Even more surprisingly, for higher sugar levels the positive relationship (again: see bivariate section) is reversed.

Next, we have a look at our third variable of interest: volatile acidity. High levels of acidity are often associated with a vinegary taste. We give a representative example of volatile acidity with another variable:

Very high amounts of volatile acidity are correlated with low quality (as we saw before). Looking at wines with quality 4 (for example), we see that high levels of volatile acidity cannot be offset by adding “freshness” in form of citric acids. We couldn’t find other variables that would do the job.

Finally, we examine total sulfur dioxide and citric (similar for other variables):

As the data is too clustered, we cut off the upper and lower 5% quantiles:

For quality 3 and 9 there are not enough data points to be confident in the pattern. It would be interesting to have more wines at the ends of the quality scale in the dataset in order to determine if the interactions are persistent for high or low quality wines. Across the other quality levels, we cannot detect any strong interactions.

Let us formulate our findings from this section. Afterwards we give a brief overall summary.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We focused on the four variables (alcohol, chlorides, total sulfur dioxide and volatile acidity), from the bivariate analysis. We investigated their relationship with quality and tried to identify interactions with other variables. Only for alcohol we are confident to have identified some interactions with sugar. Alcohol and quality show different patterns especially for medium (2-10) and high (>10) sugar levels.

Were there any interesting or surprising interactions between features?

The alcohol distribution for white wines with high sugar levels is quite different than for lower sugar levels. This wasn’t expected.


Final Plots and Summary

Plot One

Description One

Most white wines obtain a rating between 5 and 7. Only very few ratings of 3 and 4 or 8 and 9 are assigned. There are no ratings less than 3 and no wine is rated 10. One might be more interested in determining how to identify a very good wine instead of a wine of average quality. So, it would have been useful to have some wines of quality 9 (or even 10). For some analysis it can be helpful to combine ratings of 3 to 5 to one group:

Plot Two

Description Two

Medians of the variable chlorides are strictly decreasing with increasing wine quality. The IQR of chlorides of course still overlap for different quality ratings. Most outliers are found for low and medium ratings. The correlation coefficient between quality and chlorides is -0.2100312 and with -0.2168668 even a little stronger when grouping the lower ratings. Chlorides and quality show the second strongest (linear) relationship (after alcohol).

Plot Three

We fitted second degree (natural cubic) B-splines to reveal trends more clearly. On average, quality increases with alcohol (correlation coefficient: 0.4358427) and decreases with residual sugar content (especially very high sugar content is more often found for wines with quality of 6 or less).

For sugar content between 3 and 10 (estimate) alcohol increases stronger with quality as for lower sugar levels. For higher sugar contents, alcohol even seems to decrease on average.


Reflection

The dataset contains almost 5000 white wines that were rated by three experts. Eleven chemical attributes like sulfur content, pH level etc. are listed.

In general, there are no striking linear correlations between wine quality and its chemical properties. Our visualizations suggest that at least alcohol and chlorides are significantly correlated to the quality of white wines. It is helpful to look at the full range of chemical variables, e.g. very high amounts of chlorides or volatile acidity seem to have a negative impact on the quality. Wine quality is centered around medium ratings and ratings of 9 are rare. There are no wines that obtained a rating of 10. Therefore, we focused on overall trends. Among the other variables strong relationships can be found (and easily explained, e.g. more sugar increases the density).

Weaker relationships between the quality and the chemical attributes could be found. This is little surprising because we can hardly expect to perfectly model (the only little understood and very complex sense) human taste with only eleven variables.

More information can be extracted by looking at combinations of chemical variables. Here, we found the combination of alcohol and sugar to give further insight. Other combinations did not show interactions.

Our analysis suggests four main points:

For a better understanding of wine quality more chemical properties are needed. As human taste is very complex, it might be useful to include other non-chemical attributes, like wine type, location, hours of sunshine etc.

Also, I think it would be very interesting to include price as a variable. This could give the analysis a whole new perspective.